Skip to content

Add HydraGNN distributed training recipe for AMD MI355X#32

Merged
ashwinma merged 4 commits into
mainfrom
hydragnn-distributed-training
May 20, 2026
Merged

Add HydraGNN distributed training recipe for AMD MI355X#32
ashwinma merged 4 commits into
mainfrom
hydragnn-distributed-training

Conversation

@ashwinma
Copy link
Copy Markdown
Collaborator

@ashwinma ashwinma commented May 19, 2026

Summary

  • Adds sbatch_train_amd.sh for multi-node HydraGNN GFM training on AMD Instinct MI355X via SLURM + Apptainer (MPI-enabled ADIOS, no DDStore)
  • Overhauls build_overlay_amd.sh to compile adios2 from source with MPI, pin HydraGNN SHA, and use node-local scratch for build I/O
  • Documents the full workflow in recipes/train/README.md with validated performance numbers from a 50-batch sanity test (1 node, 8 GPUs)

What's included

File Change
examples/sbatch_train_amd.sh New — SLURM batch script with embedded rank script, RCCL multi-node env vars, monkey-patch for avg_num_neighbors
examples/build_overlay_amd.sh Rewritten — MPI-enabled adios2, node-local build, vesin/e3nn deps, pinned SHA
examples/run_train.sh Updated — now a Docker/single-GPU entrypoint; defers to sbatch for HPC
recipes/train/README.md Complete rewrite — env var reference, launch diagram, validated metrics
model.yaml Updated container image (rocm7.2.2), added AI4S_SHARED_DIR and training env vars
.cursor/skills/ai4science-studio/SKILL.md Lessons: --gpus-per-node requirement, PMIx shared memory fix
.cursor/skills/ai4science-material-science/SKILL.md HydraGNN multi-node training pattern
.claude/commands/init-cluster.md Added scratch_local, RCCL socket_ifname, IB HCA discovery

Validated on

  • Cluster: Vultr Lux (MI355X gfx950)
  • Config: 1 node, 8 GPUs, batch_size=200, fp64, ANI1x + Alexandria datasets
  • Result: 50 batches in ~130s (2.4 s/batch steady state), 7.5 GB peak VRAM, no RCCL errors

Key design decisions

  1. Rank script is embedded in the batch script (heredoc) — not a separate repo file. Generated at runtime to $HG_OUTPUT_DIR/hydragnn-rank-<jobid>.sh for debuggability.
  2. No DDStore (phase 1) — each rank opens ADIOS files directly via MPI communicator. DDStore deferred to phase 2 for 32+ node scale.
  3. Monkey-patch avg_num_neighbors — injects precomputed value (13.74) to skip expensive full-dataset neighbor-degree scan at init.
  4. SLURM env vars not explicitly passed — Apptainer inherits the host environment; only script-computed or defaulted vars use --env.
  5. Multi-node RCCL vars gated on NODES > 1 — single-node runs don't need IB/network config.

Test plan

  • 50-batch sanity test (1 node / 8 GPUs) — passed
  • Multi-node (2+ nodes) validation
  • Full-epoch training run
  • Overlay build from scratch on a fresh node

ashwinma and others added 4 commits May 19, 2026 02:14
Adds sbatch_train_amd.sh for multi-node/multi-GPU HydraGNN GFM training
using MPI-enabled ADIOS2 multi-dataset loading (no DDStore) on AMD
Instinct MI355X via Apptainer.

Key design decisions validated on Lux cluster:
- Use --gpus-per-node=8 (not --gpus-per-task=1) to allow RCCL full
  topology discovery via KFD sysfs; per-task isolation causes
  "Could not read node" RCCL errors on MI355X
- Single-node only needs HSA_NO_SCRATCH_RECLAIM=1 (wiki-documented)
- Multi-node RCCL env vars (IB HCA, socket ifname, etc.) are
  parameterized for site-specific override
- Monkey-patch AdiosMultiDataset.avg_num_neighbors to skip expensive
  full-dataset degree scan at init

Also updates:
- build_overlay_amd.sh: MPI-enabled adios2 build from source with cmake
- model.yaml: add training recipe and AI4S_SHARED_DIR env var
- recipes/train/README.md: full runbook
- init-cluster: add scratch_local and RCCL network discovery
- Studio + material science skills: GPU visibility lesson, PMIx fix

Co-authored-by: Cursor <cursoragent@cursor.com>
50-batch sanity test passed on MI355X (1 node, 8 GPUs):
- 2.4 s/batch steady state, 130s total training time
- 7.5 GB peak allocated / 9.0 GB reserved per GPU
- No RCCL errors, all ranks converged
- Environment: fp64, batch_size=200, ANI1x+Alexandria datasets

Also adds HYDRAGNN_MAX_NUM_BATCH, HYDRAGNN_VALTEST, SCRATCH_LOCAL
to the environment variable reference table.

Co-authored-by: Cursor <cursoragent@cursor.com>
Apptainer inherits the host process environment by default — SLURM_JOB_ID,
SLURM_JOB_NUM_NODES, SLURM_PROCID, and SLURM_CPUS_PER_TASK are already
set by srun's PMIx launcher in each rank's environment. Explicit --env
lines for these were misleading (implied they wouldn't propagate).

Co-authored-by: Cursor <cursoragent@cursor.com>
Previously, RCCL fell back to socket transport because the ANP plugin
and libionic were not bind-mounted into the container (--rocm does not
expose them). This adds the required bind-mounts and MPI ob1/tcp
configuration for Pensando/ionic fabrics.

Key changes:
- Bind-mount librccl-anp.so and libionic.so.1 from host into container
- Add MPI ob1/tcp transport (ionic /31 subnets don't route for verbs)
- Read all cluster-specific values from .cluster-config.yaml (gitignored)
- Add network fields to .cluster-config.example.yaml
- Add inference recipe and convergence tracking tooling
- Remove all cluster-specific names from committed files

Validated: 2-node/16-GPU training with NCCL_DEBUG=INFO confirms
RCCL-ANP plugin loaded, all 8 ionic HCAs active, GDRDMA channels
established for cross-node GPU allreduce.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ashwinma ashwinma merged commit c0dafba into main May 20, 2026
2 checks passed
@ashwinma ashwinma deleted the hydragnn-distributed-training branch May 20, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant